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Abstract 

We introduce a framework for filtering fea- 
tures that employs the Hilbert- Schmidt In- 
dependence Criterion (HSIC) as a measure 
of dependence between the features and the 
labels. The key idea is that good features 
should maximise such dependence. Fea- 
ture selection for various supervised learning 
problems (including classification and regres- 
sion) is unified under this framework, and 
the solutions can be approximated using a 
backward-elimination algorithm. We demon- 
strate the usefulness of our method on both 
artificial and real world datasets. 



1 Introduction 

In supervised learning problems, we are typically given 
m data points x G A' and their labels y G y. The 
task is to find a functional dependence between x and 
f : X I — > y^ subject to certain optimality condi- 
tions. Representative tasks include binary classifica- 
tion, multi-class classification, regression and ranking. 
We often want to reduce the dimension of the data (the 
number of features) before the actual learning ( |Guyon 



& Elisseeff, 2003); a larger number of features can be 



associated with higher data collection cost, more dif- 
ficulty in model interpretation, higher computational 
cost for the classifier, and decreased generalisation 
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ability. It is therefore important to select an infor- 
mative feature subset. 

The problem of supervised feature selection can be 
cast as a combinatorial optimisation problem. We 
have a full set of features, denoted S (whose elements 
correspond to the dimensions of the data). We use 
these features to predict a particular outcome, for 
instance the presence of cancer: clearly, only a subset 
T of features will be relevant. Suppose the relevance 
of T to the outcome is quantified by Q(T), and 
is computed by restricting the data to the dimen- 
sions in T. Feature selection can then be formulated as 



To 



argmax Q(T) 
res ^ ^ 



subject to |T| < t, (1) 



where | • | computes the cardinality of a set and t up- 
per bounds the number of selected features. Two im- 
portant aspects of problem ([T]) are the choice of the 
criterion Q{T) and the selection algorithm. 

Feature Selection Criterion. The choice of Q{T) 
should respect the underlying supervised learning 
tasks — estimate dependence function / from train- 
ing data and guarantee / predicts well on test data. 
Therefore, good criteria should satisfy two conditions: 

I: Q{T) is capable of detecting any desired (nonlin- 
ear as well as linear) functional dependence be- 
tween the data and labels. 
II: Q(T) is concentrated with respect to the under- 
lying measure. This guarantees with high proba- 
bility that the detected functional dependence is 
preserved in the test data. 
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While many feature selection criteria have been ex- 
plored, few take these two conditions explicitly into 
account. Examples include the leave-one-out error 



bound of SVM ( [Weston et al.[ |2QQQp an d the mu- 
tual information (Koller & Sahami, 1996). Although 
the latter has good theoretical justification, it requires 
density estimation, which is problematic for high di- 
mensional and continuous variables. We sidestep 
these problems by employing a mutual-information 
like quantity — the Hilbert Schmidt Independence 



Criterion (HSIC) ( [Gretton et al.| [2005|. HSIC uses 
kernels for measuring dependence and does not require 
density estimation. HSIC also has good uniform con- 
vergence guarantees. As we show in section [2j HSIC 
satisfies conditions I and II, required for Q{T). 

Feature Selection Algorithm. Finding a global 
optimum for ([T]) is in general NP-hard ( | West on et al. 



2OO3I). Many algorithms transform ([T]) into a continu- 



ous problem by introducing weights on the dimensions 
dWeston et aH |2QQQ[ |2QQ3| ). These methods perform 
well for linearly separable problems. For nonlinear 
problems, however, the optimisation usually becomes 
non-convex and a local optimum does not necessarily 
provide good features. Greedy approaches - forward 
selection and backward elimination - are often used to 
tackle problem ([T]) directly. Forward selection tries to 
increase Q{T) as much as possible for each inclusion of 
features, and backward elimination tries to achieve this 



for each deletion of features (Guyon et al. , 2002). Al- 
though forward selection is computationally more ef- 
ficient, backward elimination provides better features 
in general since the features are assessed within the 
context of all others. 

BAHSIC. In principle, HSIC can be employed using 
either the forwards or backwards strategy, or a mix of 
strategies. However, in this paper, we will focus on 
a backward elimination algorithm. Our experiments 
show that backward elimination outperforms forward 
selection for HSIC. Backward elimination using HSIC 
(BAHSIC) is a filter method for feature selection. It 
selects features independent of a particular classifier. 
Such decoupling not only facilitates subsequent feature 
interpretation but also speeds up the computation over 
wrapper and embedded methods. 

Furthermore, BAHSIC is directly applicable to binary, 
multiclass, and regression problems. Most other fea- 
ture selection methods are only formulated either for 
binary classification or regression. The multi-class ex- 
tension of these methods is usually accomplished us- 
ing a one-versus-the-rest strategy. Still fewer methods 
handle classification and regression cases at the same 
time. BAHSIC, on the other hand, accommodates all 



these cases in a principled way: by choosing different 
kernels, BAHSIC also subsumes many existing meth- 
ods as special cases. The versatility of BAHSIC origi- 
nates from the generality of HSIC. Therefore, we begin 
our exposition with an introduction of HSIC. 

2 Measures of Dependence 

We define A! and y broadly as two domains from which 
we draw samples (x, y): these may be real valued, vec- 
tor valued, class labels, strings, graphs, and so on. We 
define a (possibly nonlinear) mapping (j){x) G from 
each X G A' to a feature space JT, such that the in- 
ner product between the features is given by a kernel 
function k{x^x') := {(j){x)^(j){x'))\ T is called a repro- 
ducing kernel Hilbert space (RKHS). Likewise, let Q 
be a second RKHS on y with kernel /(•,•) and feature 
map V^(^). We may now define a cross-covariance op- 
erator between these feature maps, in accordance with 



Baker (1973); Fukumizu et al. (2004): this is a linear 



operator Cxy ' Q 1 — ^ ^ such that 



: E^^[(^(X) - 11^) (V^(^) - lly)l (2) 



where is the tensor product. The square of the 
Hilbert- Schmidt norm of the cross-covariance operator 
(HSIC), II Cxy IIhs' ^^^^ M^ed as our feature selection 



criterion Q{T). [Gretton et aT] ( |2005| ) show that HSIC 
can be expressed in terms of kernels as 



HSIC(j^,a,Pr) 

xy 



I ^xy IIhS 



- ^xx'yy'[K^,x')l{y,y')\ 
2E, 



-^x 



(3) 

>[k{x,x')]¥.yy>[l{y,y')] 



'-•xy ^x' \k{2 



where E 



xx'yy' 



is the expectation over both {x^y) 



Y^Txy and an additional pair of variables {x' ^ y') ~ Pr^^^y 
drawn independently according to the same law. Pre- 
vious work used HSIC to measure independence be- 



tween two sets of random variables (Gretton et al. 



2005). Here we use it to select a subset T from the 



first full set of random variables S. We now describe 
further properties of HSIC which support its use as a 
feature selection criterion. 



Property (I) [Gretton et al] ( |2005[ Theorem 4) show 
that whenever Q are RKHSs with universal kernels 
/c, / on respective compact domains X and y in the 
sense of [Steinwart (2002), then HSIC(^, ^, Pr^^^) 





if and only if x and y are independent. In terms of 
feature selection, a universal kernel such as the Gaus- 
sian RBF kernel or the Laplace kernel permits HSIC 
to detect any dependence between X and y. HSIC is 
zero if and only if features and labels are independent. 

In fact, non-universal kernels can also be used for 
HSIC, although they may not guarantee that all de- 
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pendencies are detected. Different kernels incorporate 
distinctive prior knowledge into the dependence esti- 
mation, and they focus HSIC on dependence of a cer- 
tain type. For instance, a linear kernel requires HSIC 
to seek only second order dependence. Clearly HSIC is 
capable of finding and exploiting dependence of a much 
more general nature by kernels on graphs, strings, or 
other discrete domains. 

Property (II) Given a sample Z = 
{{xi.yi), . . . , {xm, Vm)} of size m drawn from 
Pr^;^, we derive an unbiased estimate of HSIC, 



= ( ^ [tr(KL) 

m(m— 3) L V ^ 



(4) 



1 ' Kll ' LI 

(m-l)(m-2) 



^l^KLl] 

m— 2 -I 



where K and L are computed as K^j = (1 — 
6ij)k{xi^Xj) and L^j = (1 — Sij)l{yi^yj). Note that 
the diagonal entries of K and L are set to zero. The 
following theorem, a formal statement that the empir- 
ical HSIC is unbiased, is proved in the appendix. 

Theorem 1 (HSIC is Unbiased) Let denote 
the expectation taken over m independent observations 
{xi^Hi) drawn from Ft xy. Then 

HSIC(j^, g, Pr) = Ez [HSIC(j^, Z)] . (5) 

xy 

This property is by contrast with the mutual informa- 
tion, which can require sophisticated bias correction 
strategies (e.g. Nemenman et al.]|2QQ2[ ). 



U-Statistics. The estimator in Q can be alterna- 
tively formulated using U-statistics, 

m 

HSIC(^,g,Z) = (m)4-i J2 h{i,j,q,r), (6) 



where (m)^ = 



is the Pochhammer coefficient 



(m— n)! 

and where denotes the set of all r-tuples drawn 
without replacement from {!,..., m}. The kernel h of 
the U-statistic is defined by 

~^ (Kst + L^^ — 2 Kg^ Lg^) , (7) 

(s,t,n,v) 

where the sum in ^ represents all ordered quadruples 
(5, t, v) selected without replacement from (i, j, g, r). 

We now show that HSIC(JF, is concentrated. 

Furthermore, its convergence in probability to 
HSIC(^, ^, Pr^;^) occurs with rate which is a 

slight improvement over the convergence of the biased 



Theorem 2 (HSIC is Concentrated) Assume /c, / 
are bounded almost everywhere by 1, and are non- 
negative. Then for m > 1 and all S > 0, with proba- 
bility at least 1 — 6 for all Fvxy 



|HSIC(j^, g, Z) - HSIC(j^, g, Pr) I < 8v'log(2/(5)/m 

xy 

By virtue of (|6| we see immediately that HSIC is a 
U-statistic of order 4, where each term is bounded in 
[—2, 2]. Applying Hoeffing's bound as in Gretton et al 



( [2QQ5| proves the result. 



These two theorems imply the empirical HSIC closely 
refiects its population counterpart. This means 
the same features should consistently be selected to 
achieve high dependence if the data are repeatedly 
drawn from the same distribution. 



Asy mptotic Normahty. It follows from Serfling 
( |198Q ) that under the assumptions E(/i^) < 00 and 
that the data and labels are not independent, the em- 
pirical HSIC converges in distribution to a Gaussian 
random variable with mean HSIC(^, ^, Pr^^^) and vari- 
ance 



^2 

^HSIC 



16 



{R - HSIC^) , where 



(8) 



estimator by Gretton et al. (2005). 



i=l U,q,r)ei^\{i} 

and i^\{i} denotes the set of all r-tuples drawn with- 
out replacement from {l,...,m} \ {i}. The asymp- 
totic normality allows us to formulate statistics for a 
significance test. This is useful because it may provide 
an assessment of the dependence between the selected 
features and the labels. 

Simple Computation. Note that HSIC(J^, ^, Z) is 
simple to compute, since only the kernel matrices K 
and L are needed, and no density estimation is in- 
volved. For feature selection, L is fixed through the 
whole process. It can be precomputed and stored for 
speedup if needed. Note also that HSIC(^, ^, Z) does 
not need any explicit regularisation parameter. This 
is encapsulated in the choice of the kernels. 

3 Feature Selection via HSIC 

Having defined our feature selection criterion, we now 
describe an algorithm that conducts feature selection 
on the basis of this dependence measure. Using HSIC, 
we can perform both backward (BAHSIC) and for- 
ward (FOHSIC) selection of the features. In particu- 
lar, when we use a linear kernel on the data (there is 
no such requirement for the labels), forward selection 
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and backward selection are equivalent: the objective 
function decomposes into individual coordinates, and 
thus feature selection can be done without recursion in 
one go. Although forward selection is computationally 
more efficient, backward elimination in general yields 
better features, since the quality of the features is as- 
sessed within the context of all other features. Hence 
we present the backward elimination version of our al- 
gorithm here (a forward greedy selection version can 
be derived similarly). 

BAHSIC appends the features from S to the end of a 
list so that the elements towards the end of have 
higher relevance to the learning task. The feature se- 
lection problem in ([T]) can be solved by simply taking 
the last t elements from . Our algorithm produces 
recursively, eliminating the least relevant features 
from S and adding them to the end of at each 
iteration. For convenience, we also denote HSIC as 
HSIC(cr, 5), where S are the features used in comput- 
ing the data kernel matrix K, and a is the parameter 
for the data kernel (for instance, this might be the size 
of a Gaussian kernel k{x^x') = exp(— a \\x — x'\\^) ). 

Algorithm 1 BAHSIC 

Input: The full set of features S 

Output: An ordered set of features S"^ 

repeat 

J^argmaxx E,gx HSIC(a, 5 1(Z S 



until 5 = 



Step 3 of the algorithm denotes a policy for adapt- 
ing the kernel parameters, e.g. by optimising over 
the possible parameter choices. In our experiments, 
we typically normalize each feature separately to zero 
mean and unit variance, and adapt the parameter 
for a Gaussian kernel by setting a to l/(2(i), where 
d = \ S\ — 1. If we have prior knowledge about the 
type of nonlinearity, we can use a kernel with fixed 
parameters for BAHSIC. In this case, step 3 can be 
omitted. 

Step 4 of the algorithm is concerned with the selection 
of a set X of features to eliminate. While one could 
choose a single element of 5, this would be inefficient 
when there are a large number of irrelevant features. 
On the other hand, removing too many features at 
once risks the loss of relevant features. In our exper- 
iments, we found a good compromise between speed 
and feature quality was to remove 10% of the current 



features at each iteration. 

4 Connections to Other Approaches 

We now explore connections to other feature selec- 
tors. For binary classification, an alternative criterion 
for selecting features is to check whether the distri- 
butions Vi:{x\y = 1) and Vi{x\y = —1) differ. For 
this purpose one could use Maximum Mean Discrep- 
ancy (MMD) fBorgwardt et al. 2QQ6| ). Likewise, one 
could use Kernel Target Alignment (KTA) ( Cri stianini| 
et al.[ |2QQ3[ ) to test directly whether there exists any 
correlation between data and labels. KTA has been 
used for feature selection. Formally it is defined as 
trKL/||K||||L||. For computational convenience the 



normalisation is often omitted in practise (Neumann 
et al. , 2005), which leaves us with trKL. We discuss 



this unnormalised variant below. 

Let us consider the output kernel l{y^y') = p{y)p{y')^ 
where p{l) = m^^ and p(— 1) = — ml^, and m+ and 
m_ are the numbers of positive and negative samples, 
respectively. With this kernel choice, we show that 
MMD and KTA are closely related to HSIC. The fol- 
lowing theorem is proved in the appendix. 

Theorem 3 (Connection to MMD and KTA) 

Assume the kernel k{x^ x') for the data is bounded and 
the kernel for the labels is l{y^y') = p{y)p{y'). Then 

|HSIC - (m - 1)-^MMD| = 0{m-^) 
|HSIC - (m - 1)-2KTA| = 0(rn-^). 

This means selecting features that maximise HSIC also 
maximises MMD and KTA. Note that in general (mul- 
ticlass, regression, or generic binary classification) this 
connection does not hold. 

5 Variants of BAHSIC 

New variants can be readily derived from BAHSIC by 
combining the two building blocks of BAHSIC: a ker- 
nel on the data and another one on the labels. Here 
we provide three examples using a Gaussian kernel on 
the data, while varying the kernel on the labels. This 
provides us with feature selectors for three problems: 

Binary classification (BIN) We set m^^ as the la- 
bel for positive class members, and ml^ for negative 
class members. We then apply a linear kernel. 

Multiclass classification (MUL) We apply a linear 
kernel on the labels using the label vectors below, as 
described for a 3-class example. Here is the number 
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of samples in class i and 
ones with length m^. 



denotes a vector of all 



nri2- 



7713— m 



r?VL — m 



7TI3— m 



(9) 



1X3 



Regression (REG) A Gaussian RBF kernel is also 
used on the labels. For convenience the kernel width a 
is fixed as the median distance between points in the 



sample (Scholkopf fc Smola, 2002). 



For the above variants a further speedup of BAHSIC 
is possible by updating the entries of the kernel matrix 
incrementally, since we are using an RBF kernel. We 



use the fact that \\x — x^\\ 
\\x 



Y,j \\xj - x^-ll . Hence 
x'W^ needs to be computed only once. Subse- 



quent updates are effected by subtracting \\xj — Xj\\ 
(subscript here indices dimension). 

We will use BIN, MUL and REG as the particular in- 
stances of BAHSIC in our experiments. We will refer 
to them commonly as BAHSIC since the exact mean- 
ing will be clear depending on the datasets encoun- 
tered. Furthermore, we also instantiate FOHSIC us- 
ing the same kernels as BIN, MUL and REG, and we 
adopt the same convention when we refer to it in our 
experiments. 

6 Experimental Results 

We conducted three sets of experiments. The char- 
acteristics of the datasets and the aims of the ex- 
periments are: {i) artificial datasets illustrating the 
properties of BAHSIC; (ii) real datasets that compare 
BAHSIC with other methods; and (m) a brain com- 
puter interface dataset showing that BAHSIC selects 
meaningful features. 

6.1 Artificial datasets 

We constructed 3 artificial datasets, as illustrated in 
Figure [l] to illustrate the difference between BAH- 
SIC variants with linear and nonlinear kernels. Each 
dataset has 22 dimensions — only the first two dimen- 
sions are related to the prediction task and the rest are 
just Gaussian noise. These datasets are {i) Binary 
XOR data: samples belonging to the same class have 
multimodal distributions; (ii) Multiclass data: there 
are 4 classes but 3 of them are collinear; (iii) Nonlin- 
ear regression data: labels are related to the first 



two dimension of the data hy y = Xi exp(- 



x|) + e. 



where e denotes additive Gaussian noise. We compare 
BAHSIC to FOHSIC, Pearson's correlation, mutual 
information (iZaffalon & Hutterl 120021), and RELIEF 



(RELIEF works only for binary problems). We aim 
to show that when nonlinear dependencies exist in the 





100 200 300 400 



Figure 1: Artificial datasets and the performance of dif- 
ferent methods when varying the number of observations. 
Left column, top to bottom: Binary, multiclass, and 
regression data. Different classes are encoded with dif- 
ferent colours. Right column: Median rank (y-axis) of 
the two relevant features as a function of sample size (x- 
axis) for the corresponding datasets in the left column. 
(Blue circle: Pearson's correlation; Green triangle: RE- 
LIEF; Magenta downward triangle: mutual information; 
Black triangle: FOHSIC; Red square: BAHSIC.) 



data, BAHSIC with nonlinear kernels is very compe- 
tent in finding them. 

We instantiate the artificial datasets over a range of 
sample sizes (from 40 to 400), and plot the median 
rank, produced by various methods, for the first two 
dimensions of the data. All numbers in Figure [l] are 
averaged over 10 runs. In all cases, BAHSIC shows 
good performance. More specifically, we observe: 

Binary XOR Both BAHSIC and RELIEF correctly 
select the first two dimensions of the data even for 
small sample sizes; while FOHSIC, Pearson's correla- 
tion, and mutual information fail. This is because the 
latter three evaluate the goodness of each feature inde- 
pendently. Hence they are unable to capture nonlinear 
interaction between features. 

Multiclass Data BAHSIC, FOHSIC and mutual in- 
formation select the correct features irrespective of the 
size of the sample. Pearson's correlation only works for 
large sample size. The collinear ity of 3 classes provides 
linear correlation between the data and the labels, but 
due to the interference of the fourth class such corre- 
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lation is picked up by Pearson's correlation only for a 
large sample size. 

Nonlinear Regression Data The performance 
of Pearson's correlation and mutual information is 
slightly better than random. BAHSIC and FOHSIC 
quickly converge to the correct answer as the sample 
size increases. 

In fact, we observe that as the sample size increases, 
BAHSIC is able to rank the relevant features (the first 
two dimensions) almost correctly in the first iteration 
(results not shown). While this does not prove BAH- 
SIC with nonlinear kernels is always better than that 
with a linear kernel, it illustrates the competence of 
BAHSIC in detecting nonlinear features. This is ob- 
viously useful in a real- world situations. The second 
advantage of BAHSIC is that it is readily applicable to 
both classification and regression problems, by simply 
choosing a different kernel on the labels. 

6.2 Real world datasets 

Algorithms In this experiment, we show that the 
performance of BAHSIC can be comparable to other 
state-of-the-art feature selectors, namely SVM Re- 
cursive Feature Elimination (RFE) ( |Guyon et al. 



used the same SVM for all methods: a Gaussian ker- 
nel with (T set as the median distance between points 



2QQ2| ) RELIEF (|Kira fc RendeU; T992| , Lp-norm SVM 



(Lq) dWeston et all |2QQ3J , and R2W2 ( [Weston et al 
2QQQ|). We used the implementation of these algo- 



rithms as given in the Spider machine learning toolbox, 
since those were the only publicly available implemen- 
tationsQ Furthermore, we also include filter methods, 
namely FOHSIC, Pearson's correlation (PC), and mu- 
tual information (MI), in our comparisons. 

Datasets We used various real world datasets taken 
from the UCI repository]^ the Statlib repository]^ the 
LibSVM website0and the NIPS feature selection chal- 
leng^for comparison. Due to scalability issues in Spi- 
der, we produced a balanced random sample of size less 
than 2000 for datasets with more than 2000 samples. 

Experimental Protocol We report the perfor- 
mance of an SVM using a Gaussian kernel on a feature 
subset of size 5 and 10-fold cross-validation. These 5 
features were selected per fold using different meth- 
ods. Since we are comparing the selected features, we 



^spider' 



^ http : //www . kyb . tuebingen . mpg . de/bs/people/ 
— 1 



http : //www . ics . uci . edu/^mlearn /ML Summary . html 
^http : //lib . Stat . emu . edu/ datasets/ 

4 



http : //www . csie . ntu . edu . tw/'^c j lin/ 
[libsvmtools/datasets^/ _ 



in the sample ( Scholkopf & Smola 2002 ) and regular- 



ization parameter C = 100. On classification datasets, 
we measured the performance using the error rate, and 
on regression datasets we used the percentage of vari- 
ance not-explained (also known as 1 — r^). The results 
for binary datasets are summarized in the first part of 
Table [l] Those for multiclass and regression datasets 
are reported respectively in the second and the third 
parts of Table [l] 

To provide a concise summary of the performance of 
various methods on binary datasets, we measured how 
the methods compare with the best performing one in 
each dataset in Table [H We recorded the best abso- 
lute performance of all feature selectors as the base- 
line, and computed the distance of each algorithm to 
the best possible result. In this context it makes sense 
to penalize catastrophic failures more than small devi- 
ations. In other words, we would like to have a method 
which is at least almost always very close to the best 
performing one. Taking the ^2 distance achieves this 
effect, by penalizing larger differences more heavily. It 
is also our goal to choose an algorithm that performs 
homogeneously well across all datasets. The £2 dis- 
tance scores are listed for the binary datasets in Table 
[1] In general, the smaller the ^2 distance, the better 
the method. In this respect, BAHSIC and FOHSIC 
have the best performance. We did not produce the ^2 
distance for multiclass and regression datasets, since 
the limited number of such datasets did not allow us 
to draw statistically significant conclusions. 

6.3 Brain-computer interface dataset 

In this experiment, we show that BAHSIC selects fea- 
tures that are meaningful in practise: we use BAHSIC 
to select a frequency band for a brain-computer inter- 



face (BCI) data set from the Berlin BCI group (Dorn- 
hege et al.| [2004 ). The data contains EEG signals 



(118 channels, sampled at 100 Hz) from five healthy 
subjects ('aa', 'al', 'av', 'aw' and 'ay') recorded dur- 
ing two types of motor imaginations. The task is to 
classify the imagination for individual trials. 

Our experiment proceeded in 3 steps: {%) A Fast 
Fourier transformation (FFT) was performed on each 



Table 2: Classification errors (%) on BCI data after select- 
ing a frequency range. 



^http : / / clopinet . com/isabelle/Pro j ects/ 
|NIPS2003/ 



Subject 


aa 


al 


av 


aw 


ay 


CSP 
CSSP 
CSSSP 
BAHSIC 


17.5±2.5 
14.9±2.9 
12.2±2.1 

13.7±4.3 


3.1±1.2 
2.4±1.3 
2.2±0.9 
1.9±1.3 


32.1±2.5 
33.0±2.7 
31.8±2.8 
30.5±3.3 


7.3±2.7 
5.4±1.9 

6.3±1.8 
6.1±3.8 


6. Oil. 6 

6.2±1.5 
12.7±2.0 
9.0±6.0 
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Table 1: Classification error (%) or percentage of variance not-explained (%). The best result, and those results not 
significantly worse than it, are highlighted in bold (one-sided Welch t-test with 95% confidence level). lOO.OdiO.O*: 
program is not finished in a week or crashed. -: not applicable. 
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Figure 2: HSIC, encoded by the colour value for different frequency bands (axes correspond to upper and lower cutoff 
frequencies). The figures, left to right, top to bottom correspond to subjects 'aa', 'al', 'av', 'aw' and 'ay'. 



channel and the power spectrum was computed, {ii) 
The power spectra from all channels were averaged to 
obtain a single spectrum for each trial, {in) BAH- 
SIC was used to select the top 5 discriminative fre- 
quency components based on the power spectrum. The 
5 selected frequencies and their 4 nearest neighbours 
were used to reconstruct the temporal signals (with all 
other Fourier coefficients eliminated). The result was 
then passed to a normal CSP method (Dornhege et al. , 
[2004) for feature extraction, and then classified using 
a linear SVM. 

We compared automatic filtering using BAHSIC to 
other filtering approaches: normal CSP method with 
manual filtering (8-40 Hz), the CSSP method (|Lemm 



et al.j|2005D , and the CSSSP method (| Dornhege et al. 
2006). All results presented in Table [2| are obtained 



4 of the 5 subjects. While the CSSP and the CSSSP 
methods are specialised embedded methods (w.r.t. the 
CSP method) for frequency selection on BCI data, our 
method is entirely generic: BAHSIC decouples feature 
selection from CSP. 

In Figure [2| we use HSIC to visualise the responsive- 
ness of different frequency bands to motor imagination. 
The horizontal and the vertical axes in each subfig- 
ure represent the lower and upper bounds for a fre- 
quency band, respectively. HSIC is computed for each 
of these bands. Dornhege et al. (2006) report that the 



using 50 X 2-fold cross-validation. Our method is very 
competitive and obtains the first and second place for 



/i rhythm (approx. 12 Hz) of EEC is most responsive to 
motor imagination, and that the f3 rhythm (approx. 22 
Hz) is also responsive. We expect that HSIC will cre- 
ate a strong peak at the /i rhythm and a weaker peak 
at the /3 rhythm, and the absence of other respon- 
sive frequency components will create block patterns. 
Both predictions are confirmed in Figure (2] Further- 
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more, the large area of the red region for subject 'al' 
indicates good responsiveness of his /i rhythm. This 
also corresponds well with the lowest classification er- 
ror obtained for him in Table |2j 

7 Conclusion 

This paper proposes a backward elimination procedure 
for feature selection using the Hilbert-Schmidt Inde- 
pendence Criterion (HSIC). The idea behind the re- 
sulting algorithm, BAHSIC, is to choose the feature 
subset that maximises the dependence between the 
data and labels. With this interpretation, BAHSIC 
provides a unified feature selection framework for any 
form of supervised learning. The absence of bias and 
good convergence properties of the empirical HSIC es- 
timate provide a strong theoretical jutification for us- 
ing HSIC in this context. Although BAHSIC is a filter 
method, it still demonstrates good performance com- 
pared with more specialised methods in both artificial 
and real world data. It is also very competitive in 
terms of runtime performance]^ 
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Appendix 

Proof [Theorem [l] Recah that Ka = La = 0. We 
prove the claim by constructing unbiased estimators 
for each term in (|3|. Note that we have three types 
of expectations, namely Ea^^E^^/^/, a partially decou- 
pled expectation E^^^y '^x' , and E^^ E^y E^^/ E^/ , which 
takes all four expectations independently. 

If we want to replace the expectations by empirical av- 
erages, we need to take care to avoid using the same 
discrete indices more than once for independent ran- 
dom variables. In other words, when taking expecta- 
tions over r independent random variables, we need r- 
tuples of indices where each index occurs exactly once. 
The sets satisfy this property. Their cardinalities 
are given by the Pochhammer symbols (m)^. Jointly 
drawn random variables, on the other hand, share the 
same index. We have 

^xy^x'y' [kix,x)l{y,y)] =Ez {m)2^ ^ '^ij'^ij 

= Ez [(m)2"^trKL] . 
In the case of the expectation over three independent 



terms E^^^ K^' E^/ we obtain 

[(m)3 ^ L^^] = [Ms ' 1^ K L 1 - tr K L] 

For four independent random variables E^^ E^ E^^/ E^/ , 

= Ez [(m)4^ (l"^Kll^Ll-4l"^KLl+2trKL)] . 

To obtain an expression for HSIC we only need to take 
linear combinations using ([3|. Collecting terms related 
to tr K L, 1^ K L 1, and r K 1 1^ L 1 yields 



HSIC(j^,a,Pr) 

xy 

E^ 



m(m— 3) 



trKL 



1 ' Kll ' LI 

(m-l)(m-2) 



1 KLl 



This is the expected value of HSIC[JF, G^Z]. ■ 

Proof [Theorem [3] We first relate a biased estimator 
of HSIC to the biased estimator of MMD. The former 
is given by 

tr KHLH where H = I -m"^ 1 1^ 

(m— 1)^ 



and the bias is bounded by 0(m ^), as shown by Gret 



[ton et al.| ( |2QQ5D . An estimator of MMD with bias 
0{m~-^) is 

m-)_ m_ 

MMD = -J^ ^ fc(xi , X,- ) + ^ (xi , X,- ) 

m-\- rri — 

k(yii^ Xj) = tr K L . 



2 



^Code is freely available as part of the Elefant package 
at http : //elefant . developer . nicta . com . au 



If we choose l{y^y') = p{y)p{y') with p{l) = m^^ 
and p(— 1) = ml^, we can see L 1 = 0. In this case 
trKHLH = trKL, which shows that the biased es- 
timators of MMD and HSIC are identical up to a con- 
stant factor. Since the bias of trKHLH is 0(m~^), 
this implies the same bias for the MMD estimate. 

To see the same result for Kernel Target Alignment, 
note that for equal class size the normalisations with 
regard to m+ and m_ become irrelevant, which yields 
the corresponding MMD term. ■ 
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